Search CORE

44 research outputs found

Direct $N$ -body code on low-power embedded ARM GPUs

Author: AR Brodtkorb
E Bortolas
F Perez
J Hunter
K Nitadori
K Nitadori
M Katevenis
M Spera
R Capuzzo-Dolcetta
R Capuzzo-Dolcetta
S Harfst
S Konstantinidis
S Walt van der
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/01/2019
Field of study

This work arises on the environment of the ExaNeSt project aiming at design and development of an exascale ready supercomputer with low energy consumption profile but able to support the most demanding scientific and technical applications. The ExaNeSt compute unit consists of densely-packed low-power 64-bit ARM processors, embedded within Xilinx FPGA SoCs. SoC boards are heterogeneous architecture where computing power is supplied both by CPUs and GPUs, and are emerging as a possible low-power and low-cost alternative to clusters based on traditional CPUs. A state-of-the-art direct

N

-body code suitable for astrophysical simulations has been re-engineered in order to exploit SoC heterogeneous platforms based on ARM CPUs and embedded GPUs. Performance tests show that embedded GPUs can be effectively used to accelerate real-life scientific calculations, and that are promising also because of their energy efficiency, which is a crucial design in future exascale platforms.Comment: 16 pages, 7 figures, 1 table, accepted for publication in the Computing Conference 2019 proceeding

arXiv.org e-Print Archive

Crossref

Extending promela and spin for real time

Author: D. Dill
Dolev
L. Lamport
M. Katevenis
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Reduced instruction set computer architectures for VLSI

Author: Katevenis M G H
Publication venue: The MIT Press
Publication date: 01/01/1985
Field of study

CERN Document Server

Credit-Flow-Controlled ATM versus Wormhole Routing

Author: D. Serpanos
E. Spyridakis
M. Katevenis
Publication venue
Publication date
Field of study

: ATMhas been adopted as the main high speed technology in both wide and local area networks. When ATM is combined with credits-the flow control mechanism that is particularly suitable for local data communication- it becomes appropriate for multiprocessor interconnection networks as well. Actually,credit-flowcontrolled ATM has similarities with wormhole routing, one of the most popular architectures for MP networks: they both use credits and fixed size cells/flits, and their hardwarecomplexity is comparable. In this paper,weshow that ATM with credits performs better than wormhole routing, because ATM uses lanes moreefficiently: ATMprovides high throughput and low latency with much less buffer space than that required by wormhole routing; also, ATM demonstrates little sensitivity to bursty traffic, and, unlike wormhole, it is fair in terms of latency in hot-spot configurations. Our simulation uses detailed and realistic switch models, which operate at clock-cycle granularity and track ..

CiteSeerX

Pipelined memory shared buffer for VLSI switches

Author: Aristides Efthymiou
Homewood M.
Karol M.
Katevenis M.
Manolis Katevenis
Panagiota Vatsolaki
Souza R.
Turner J.
Weste N.
Yeh Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Reduced instruction set computers

Author: David A. Patterson
Hopkins M.
Hopkins M.
Katevenis M.G.H.
Radin G.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

VISA: A variable instruction set architecture

Author: Alessandro De Gloria
Ellis J.R.
Katevenis M.
Patterson D.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

A systematic evaluation of emerging mesh-like CMP NoCs

Author: Chaix F.
Chrysos N.
Katevenis M.
Papaefstathiou Vasileios
Pnevmatikatos D.
Psathakis A.
Vasilakis E.
Publication venue
Publication date: 01/01/2015
Field of study

This paper studies alternative Network-on-Chip architectures for emerging many-core chip multiprocessors, by exploring the following design options on mesh-based networks: Multiple physical networks (P), cores concentration (C), express channels (X), it widths (W), and virtual channels (V). We exhaustively evaluate all combinations of the afore-mentioned parameters (P, C, X, W, V), using the energy-throughput ratio (ETR) as a metric to classify network congurations. Our experimental results show that, on one hand, with an appropriate selection of parameters (V,W), an optimized baseline 2D mesh offers the best possible ETR for NoCs with up to a few tens of cores (64-core NoC). More complicated networks, using concentration and express channels, can reduce the zero-load latency, but do not necessarily help to improve ETR. On the other hand, for larger CMPs, a 2D mesh with multiple physical networks is a better option: once optimized, this architectural choice can reduce the ETR by up to 46% for 256 cores

Chalmers Research

Chalmers Publication Library

Receive-Side Notification for Enhanced RDMA in FPGA Based Networks

Author: C Concatto
K Ovtcharov
KD Underwood
M Katevenis
RE Grant
T El-Ghazawi
WJ Dally
Publication venue
Publication date: 14/02/2019
Field of study

FPGAs are rapidly gaining traction in the domain of HPC thanks to the advent of FPGA-friendly data-flow workloads, as well as their flexibility and energy efficiency. However, these devices pose a new challenge in terms of how to better support their communications, since standard protocols are known to hinder their performance greatly either by requiring CPU intervention or consuming too much FPGA logic. Hence, the community is moving towards custom-made solutions. This paper analyses an optimization to our custom, reliable, interconnect with connectionless transport- a mechanism to register and track inbound RDMA communication at the receive-side. This way, it provides completion notifications directly to the remote node which saves a round-trip latency. The entire mechanism is designed to sit within the fabric of the FPGA, requiring no software intervention. Our solution is able to reduce the latency of a receive operation by around 20% for small message sizes (4KB) over a single hop (longer distances would experience even high improvement). Results from synthesis over a wide parameter range confirm that this optimization is scalable both in terms of the number of concurrent outstanding RDMA operations, and the maximum message size

Crossref

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The University of Manchester - Institutional Repository

Synchronization support in I/O adapter based SCI clusters

Author: E.W. Dijkstra
K. Omang
L. Lamport
L. Lamport
L. Lamport
M. Katevenis
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref